Article 



THE INTERNATIONAL JOURNAL of 

HIGH 

PERFORMANCE 

COMPUTING 

APPLICATIONS 


OpenMP task scheduling strategies 
for multicore NUMA systems 


Stephen L Olivier 1 , Allan K Porterfield 2 , Kyle B Wheeler 3 , 
Michael Spiegel 2 and Jan F Prins 1 


The International Journal of High 
Performance Computing Applications 
1-15 

© The Author(s) 2012 
Reprints and permissions: 
sagepub.co.uk/journalsPermission.nav 
DOI: 10.1 177/109434201 1434065 
hpc.sagepub.com 

(§)SAGE 


Abstract 

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency 
at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Effi¬ 
cient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an 
increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteris¬ 
tics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement 
different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance 
study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implemen¬ 
tations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core 
work-stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads round-robin scheduler. In 
addition, we compare our results against the Intel and GNU OpenMP implementations. 

Our hierarchical scheduling strategy leverages different scheduling methods at different levels of the hierarchy. By 
allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, the scheduler 
limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of 
cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In the 
performance evaluation, our Qthreads hierarchical scheduler is competitive on all benchmarks tested. On five of the 
seven benchmarks, it demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP 
run-time systems. Our run-time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix sys¬ 
tems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix. 
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I Introduction 

Task-parallel programming models offer a simple way for 
application programmers to specify parallel tasks in a 
problem-centric form that easily scales with problem size, 
leaving the scheduling of these tasks onto processors to 
be performed at run time. Task parallelism is well suited 
to the expression of nested parallelism in recursive 
divide-and-conquer algorithms and of unstructured paralle¬ 
lism in irregular computations. 

An efficient task scheduler must meet challenging and 
sometimes conflicting goals: exploit cache and memory 
locality, maintain load balance, and minimize overhead 
costs. When there is an inequitable distribution of work 
among processors, load imbalance arises. Without redistri¬ 
bution of work, some processors become idle. Load balan¬ 
cing operations, when successful, redistribute the work 
more equitably across processors. However, load balancing 
operations can also contribute to overhead costs. Load 


balancing operations between sockets increase memory 
access time due to more cold cache misses and more 
high-latency remote memory accesses. This paper pro¬ 
poses an approach to mitigate these issues and advances 
understanding of their impact through the following 
contributions: 
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1. A hierarchical scheduling strategy targeting modern 
multi-socket multicore shared memory systems 

whose NUMA architecture is not well supported by 
either work-stealing schedulers with one queue per core 
or by centralized schedulers. Our approach combines 
work stealing and shared queues for low-overhead load 
balancing and exploitation of shared caches. 

2. A detailed performance study on a current genera¬ 
tion multi-socket multicore Intel system. Seven run¬ 
time implementations supporting task-parallel 
OpenMP programs are compared: five schedulers that 
we added to the open source Qthreads library, the GNU 
GCC OpenMP run-time, and the Intel OpenMP run¬ 
time. In addition to speedup results demonstrating 
superior performance by our run-time on many of the 
diverse benchmarks tested, we examine several sec¬ 
ondary metrics that illustrate the benefits of hierarchi¬ 
cal scheduling over work-stealing schedulers with one 
queue per core. 

3. Additional performance evaluations on a two-socket 
multicore AMD system and a 192-processor SGI 
Altix. These evaluations demonstrate the performance 
portability and scalability of our run-time 
implementations. 

This paper extends work originally presented in Olivier 
et al. (2011). The remainder of the paper is organized as 
follows: Section 2 provides relevant background informa¬ 
tion, Section 3 describes existing task scheduler designs 
and our hierarchical approach, Section 4 presents the 
results of our experimental evaluation, and Section 5 dis¬ 
cusses related work. We conclude in Section 6 with some 
final observations. 

2 Background 

Broadly supported by both commercial and open source 
compilers, OpenMP allows incremental parallelization of 
serial programs for execution on shared memory parallel 
computers. Version 3.0 of the OpenMP specification for 
FORTRAN and C/C++ adds explicit task parallelism to 
complement its existing data parallel constructs (OpenMP 
Architecture Review Board, 2008; Ayguade et al., 2009). 
The OpenMP task construct generates a task from a state¬ 
ment or structured block. Task synchronization is provided 
by the taskwait construct, and the semantics of the 
OpenMP barrier construct have also been overloaded to 
require completion of all outstanding tasks. 

Execution of OpenMP programs combines the efforts of 
the compiler and an OpenMP run-time library. Intel and 
GCC both have integrated OpenMP compiler and 
run-time implementations. Using the ROSE compiler (Liao 
et al., 2010), we have created an equivalent method to com¬ 
pile and run OpenMP programs with the Qthreads (Wheeler 
et al., 2008) library. The ROSE compiler is a source-to- 
source translator that supports OpenMP 3.0 with a simple 
compiler flag. In one compile step, it produces an 


intermediate C++ file and invokes the GNU C++ 
compiler to compile that file with additional libraries to 
produce an executable. ROSE performs syntactic and 
semantic analysis on OpenMP directives, transforming 
them into run-time library calls in the intermediate pro¬ 
gram. The ROSE common OpenMP run-time library 
(XOMP) maps the run-time calls to functions in the 
Qthreads library. 

2.1 Qthreads 

Qthreads (Wheeler et al., 2008) is a cross-platform general- 
purpose parallel run-time library designed to support 
lightweight threading and synchronization in a flexible 
integrated locality framework. Qthreads directly supports 
programming with lightweight threads and a variety of syn¬ 
chronization methods, including non-blocking atomic 
operations and potentially blocking full/empty bit (FEB) 
operations like those developed for the HEP machine 
(Smith, 1981). The Qthreads lightweight threading concept 
and its implementation are intended to match future hard¬ 
ware environments by providing efficient software support 
for massive multithreading. 

In the Qthreads execution model, lightweight threads 
(qthreads) are created in user-space with a small context 
and small fixed-size stack. Unlike heavyweight threads 
such as pthreads, qthreads do not support expensive fea¬ 
tures like per-thread identifiers, per-thread signal vectors, 
or preemptive multitasking. Qthreads are scheduled onto 
a small set of worker pthreads. Logically, a qthread is the 
smallest schedulable unit of work, such as a set of loop 
iterations or an OpenMP task, and execution of a program 
generates many more qthreads than it has worker pthreads. 
Each worker pthread is pinned to a processor core and 
assigned to a locality domain, termed a shepherd. Whereas 
Qthreads previously allowed only one worker pthread per 
shepherd, we added support for multiple worker pthreads 
per shepherd. This support enables us to map shepherds 
to different architectural components, e.g. one shepherd per 
core, one shepherd per shared L3 cache, or one shepherd 
per processor socket. 

The default scheduler in the Qthreads run-time uses a 
cooperative multitasking approach. When qthreads block, 
e.g. performing an FEB operation, a context switch is trig¬ 
gered. Because this context switch is done in user-space via 
function calls and requires neither signals nor saving a full 
set of registers, it is less expensive than an operating system 
or interrupt-based context switch. This technique allows 
qthreads to execute uninterrupted until data is needed that 
is not yet available, and allows the scheduler to attempt 
to hide communication latency by switching to other 
qthreads. Obviously, this only hides communication laten¬ 
cies that take longer than a context switch. 

The Qthreads API includes several threaded loop inter¬ 
faces, built on top of the core threading components. The 
API provides three basic parallel loop behaviors: one to 
create a separate qthread for each iteration, one that divides 
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the iteration space evenly among all shepherds, and one 
that uses a queue-like structure to distribute sub-ranges of 
the iteration space to enable self-scheduled loops. We used 
the Qthreads queueing implementation as a starting point 
for our scheduling work. 

We added support for the ROSE XOMP calls to 
Qthreads allowing it to be used as the run-time for OpenMP 
programs. Although Qthreads XOMP/OpenMP support is 
not fully complete, it accepts every OpenMP program 
accepted by ROSE. We implement OpenMP threads as 
worker pthreads. Unlike many OpenMP implementations, 
default loop scheduling is self-guided rather than static, 
though the latter can be explicitly requested. For task par¬ 
allelism, we implement each OpenMP task as a qthread. 
(We use the term task rather than qthread throughout the 
remainder of the paper, both for simplicity and because the 
scheduling concepts we explore are applicable to other 
task-parallel languages and libraries.) We used the 
Qthreads FEB synchronization mechanism as a base layer 
upon which to implement taskwait and barrier 
sychronization. 

3 Task scheduler design 

The stock Qthreads scheduler, called Q in Section 4, was 
engineered for parallel loop computation. Each processor 
executes chunks of loop iterations packaged as qthreads. 
Round-robin distribution of the iterations among the shep¬ 
herds and self-scheduling are used in combination to main¬ 
tain load balance. A simple lock-free per-shepherd FIFO 
queue stores iterations as they wait to be executed. 

Task-parallel programs generate a dynamically unfold¬ 
ing sequence of interdependent tasks, often represented 
by a directed acyclic graph (DAG). A task executing on the 
same thread as its parent or sibling tasks may benefit from 
temporal locality if they operate on the same data. In par¬ 
ticular, such locality properties are a feature of divide- 
and-conquer algorithms. To efficiently schedule tasks as 
lightweight threads in Qthreads, the run-time must support 
more general dynamic load balancing while exploiting 
available locality among tasks. We implemented a modi¬ 
fied Qthreads scheduler, L , to use LIFO rather than FIFO 
queues at each shepherd to improve the use of locality. 
However, the round-robin distribution of tasks between 
shepherds does not provide fully dynamic load balancing. 

3 . I Work-stealing and centralized schedulers 

To better meet the dual goals of locality and load balance, 
we implemented work stealing. Blumofe et al. proved that 
work stealing is optimal for multithreaded scheduling of 
DAGs with minimal overhead costs (Blumofe and Leiser- 
son 1994), and they implemented it in their Cilk run-time 
scheduler (Blumofe et al., 1995). Our initial implementa¬ 
tion of work stealing in Qthreads, WS, mimics Cilk’s sche¬ 
duling discipline: each shepherd schedules tasks depth-first 
locally through LIFO queue operations. An idle shepherd 


obtains more work by stealing the oldest tasks from the task 
queue of a busy shepherd. We implemented two different 
probing schemes to find a victim shepherd, observing 
equivalent performance: choosing randomly and commen¬ 
cing search at the nearest shepherd ID to the thief. In the 
work-stealing scheduler, interruptions to busy shepherds 
are minimized because the burden of load balancing is 
placed on the idle shepherds. Locality is preserved because 
newer tasks, whose data is still hot in the processor’s cache, 
are the first to be scheduled locally and the last in line to be 
stolen. 

The cost of work-stealing operations on multi-socket 
multicore systems varies significantly based on the relative 
locations of the thief and victim, e.g. whether they are 
running on cores on the same chip or on different chips. 
Stealing between cores on different chips reduces perfor¬ 
mance by incurring higher overhead costs, additional cold 
cache misses, remote locking, remote memory access costs, 
and coherence misses due to false sharing. Another limita¬ 
tion of work stealing is that it does not make the best pos¬ 
sible use of caches shared among cores. In contrast, Chen 
et al. (2007) showed that a depth-first schedule close to 
serial order makes better use of a shared cache than work 
stealing, assuming serial execution of an application makes 
good use of the cache. Blelloch et al. had shown that such a 
schedule can be achieved using a shared LIFO queue 
(Blelloch et al., 1999). We implemented a centralized 
shared LIFO queue, CQ , for Qthreads, but it is a poor match 
for multi-socket multicore systems since not all cores, but 
only cores on the same chip, share the same cache. More¬ 
over, the centralized queue implementation is not scalable, 
as contention drives up the overhead costs. 

3.2 Hierarchical scheduling 

To overcome the limitations of both work stealing and 
shared queues, we developed a hierarchical approach: mul¬ 
tithreaded shepherds, MTS. We create one shepherd for all 
the cores on the same chip. These cores share a cache, typi¬ 
cally L3, and all are proximal to a local memory attached to 
that socket. Within each shepherd, we map one pthread 
worker to each core. Among workers in each shepherd, a 
shared LIFO queue provides depth-first scheduling close 
to serial order to exploit the shared cache. Thus, load balan¬ 
cing happens naturally among the workers on a chip and 
concurrent tasks have possible overlapping localities that 
can be captured in the shared cache. 

Between shepherds, work stealing is used to maintain 
load balance. Each time the shepherd’s task queue becomes 
empty, only the first worker to find the queue empty sets a 
flag and commences stealing. The other workers in the 
shepherd spin on cached copies of the flag until the steal 
is complete and the stealing thread resets the flag. The thief 
thread steals enough tasks from another shepherd’s queue 
to supply the workers in its shepherd with work. The 
number of tasks stolen per steal is a tunable parameter, but 
stealing one per worker in the shepherd ensures that at least 
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#pragma omp single 

#pragma omp for schedule (dynamic) 

for (si = 0; si < nseqs; si++) 

for (si = 0; si < nseqs; si++) 

for (sj = si + 1; sj < nseqs; sj++) 

for (sj = si+1; sj < nseqs; sj++) 

#pragma omp task firstprivate (si, sj) 

#pragma omp task firstprivate (si, sj) 

compare(seq[si], seq[sj]); 

compare(seq[si], seq[sj]); 


Figure I. Simplified code for the two versions of Alignment: single (left) and for (right). 


immediately following the steal all threads have a task to 
execute. In practice, we have observed this heuristic to be 
effective, and Section 4.4 shows how performance varies 
for different choices of this parameter. If fewer tasks are 
available then the thief steals all the available tasks on the 
victim’s queue. The stolen tasks are dequeued and collected 
into a small linked list, then enqueued at the thief s queue. 
If the steal attempt fails because no tasks are available, then 
the thief thread selects a new victim and begins another 
steal attempt. 

Centralized task queueing for workers within each 
shepherd reduces the need for remote stealing by providing 
local load balance. By allowing only one representative 
worker to steal at time, in bulk for all workers in the 
shepherd, communication overheads are reduced. While 
a shared queue can be a performance bottleneck, the num¬ 
ber of cores per chip is bounded, and intra-chip locking 
operations are fast. 

4 Evaluation 

To evaluate the performance of our hierarchical scheduler 
and the other Qthreads schedulers, we present results from 
the Barcelona OpenMP Tasks Suite (BOTS), version 1.1, 
available online (Duran and Teruel, 2010). The suite com¬ 
prises a set of task-parallel applications from various 
domains with varying computational characteristics (Duran 
et al., 2009). Our experiments used the following bench¬ 
mark components and inputs: 

• Alignment : Aligns sequences of proteins using dynamic 
programming (100 sequences); 

• Fib: Computes the nth Fibonacci number using brute- 
force recursion (n = 50); 

• Health : Simulates a national health care system over a 
series of timesteps (144 cities); 

• NQueens’. Finds solutions of the ^-queens problem 
using backtrack search (n = 14); 

• Sort’. Sorts a vector using parallel mergesort with 
sequential quicksort and insertion sort (128M integers); 

• SparseLU: Computes the LU factorization of a sparse 
matrix (10000 x 10000 matrix, 100 x 100 submatrix 
blocks); 

• Strassen : Computes a dense matrix multiply using 
Strassen’s method (8192 x 8192 matrix). 

For the Fib , Health , and NQueens benchmarks, the 
default manual cut-off configurations provided in BOTS 


are enabled to prune the generation of tasks below a 
prescribed point in the task hierarchy. For Sort , cut-offs are 
set to transition at 32K integers from parallel mergesort to 
sequential quicksort and from parallel merge tasks to 
sequential merge calls. For Strassen , the cut-off giving the 
best performance for each implementation is used. Other 
BOTS benchmarks are not presented here: UTS and FFT 
use very fine-grained tasks without cut-offs, yielding poor 
performance on all run times, and floorplan raises compila¬ 
tion issues in ROSE. 

For both the Alignment and SparseLU benchmarks, 
BOTS provides two different source files. Simplified code 
given in Figure 1 illustrates the distinction between the two 
versions of Alignment. In the first {Alignment- single) the 
loop nest that generates the tasks is executed sequentially 
by a single thread. This version creates only task paralle¬ 
lism. In the second {Alignment-ior) the outer loop is exe¬ 
cuted in parallel, creating both loop-level parallelism and 
task parallelism. Likewise, the two versions of SparseLU 
are one in which tasks are generated within single- 
threaded loop executions and another in which tasks are 
generated within parallel loop executions. 

We ran the battery of tests on seven scheduler imple¬ 
mentations: five versions of Qthreads (all compiled with 
GCC 4.4.4 -02), the GNU GCC OpenMP implementation 
(Free Software Foundation Inc., 2010), and the Intel ICC 
OpenMP implementation, as summarized in Table 1. The 
Qthreads implementations are as follows: 

• Q is the original version of Qthreads and defines each 
core to be a separate locality domain or shepherd. It 
uses a non-blocking FIFO queue to schedule tasks 
within each shepherd (individual core). Each shepherd 
only obtains tasks from its local queue, although tasks 
are distributed across shepherds on a round-robin basis 
for load balance. 

• L incorporates a simple double-ended locking LIFO 
queue to replace the original non-blocking FIFO queue. 
Concurrent access at both ends is required for work 
stealing, though L retains round-robin task distribution 
for load balance rather than work stealing. 

• CQ uses a single shepherd and centralized shared queue 
to distribute tasks among all of the cores in the system. 
This should provide adequate load balance, but conten¬ 
tion for the queue limits scalability as task size shrinks. 

• WS provides a shepherd (and individual queue) for each 
core, and idle shepherds steal tasks from the shepherds 
running on the other cores. Initial task placement is not 
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Table I. Scheduler implementations evaluated: five Qthreads implementations, ICC, and GCC. 


Qthreads Implementations, compiled Rose/GCC -02 -g 


Version 

Name 

Scheduler 

Implementation 

Number of 
Shepherds 

Task 

Placement 

Internal 

Queue Access 

External 

Queue Access 

Q 

Stock 

one per core 

round-robin 

FIFO (non-blocking) 

none 

L 

LIFO 

one per core 

round-robin 

LIFO (blocking) 

none 

CQ 

Centralized Queue 

one 

N/A 

LIFO (blocking) 

N/A 

WS 

Work-Stealing 

one per core 

local 

LIFO (blocking) 

FIFO stealing 

MTS 

MultiThreaded Shepherds 

one per chip 

local 

LIFO (blocking) 

FIFO stealing 


ICC Intel I l.l OpenMP, compiled -02 -xHost -ipo -g 

GCC GCC 4.4.4 OpenMP, compiled -02 -g 



Figure 2. Topology of the four-socket Intel Nehalem. 


round-robin between queues, but onto the local queue 
of the shepherd where it is generated, exploiting local¬ 
ity among related tasks. 

• MTS assigns one shepherd to every processor memory 
locality (shared L3 cache on chip and attached 
DIMMs). Each core on a chip hosts a worker thread that 
shares its shepherd’s queue. Only one core is allowed to 
actively steal tasks on behalf of the queue at a time and 
tasks are stolen in chunks large enough (tunable) to 
keep all of the cores busy. 

4.1 Overall performance on Intel Nehalem 

The first hardware test system for our experiments is a Dell 
PowerEdge M910 quad-socket blade with four Intel x7550 
2.0 GHz 8-core Nehalem-EX processors installed for a total 
of 32 cores. The processors are fully connected using Intel 
QuickPath Interconnect (QPI) links, as shown in Figure 2. 
Each processor has an 18 MB shared L3 cache and each core 
has a private 256 KB L2 cache as well as 32 KB LI data and 
instruction caches. The blade has 64 dual-rank 2 GB DDR3 
memory sticks (16 per processor chip) for a total of 132 GB. 
It runs CentOS Linux with a 2.6.35 kernel. Although the 
x7550 processor supports HyperThreading (Intel’s simulta¬ 
neous multithreading technology), we pinned only one 
thread to each physical core for our experiments. 

All executables using the Qthreads and GCC run-times 
were compiled with GCC 4.4.4 with -g and -02 


optimization, for consistency. Executables using the Intel 
run-time were compiled with ICC 11.1 and -02 -xHost -ipo 
optimization. Reported results are from the best of 10 runs. 

Overall, the GCC and ICC compilers produce executa¬ 
bles with similar serial performance, as shown in Table 2. 
These serial execution times provide a basis for us to com¬ 
pare the relative speedup of the various benchmarks. If the - 
ipo and -xHost flags are not used with ICC on SparseLU, 
the GCC serial executable runs 3x faster than the ICC 
executable compiled with -02 alone. The significance of 
this difference will be clearer in the presentation of parallel 
performance on SparseLU in Section 4.2. Several other 
benchmarks also run slower with those ICC flags omitted, 
though not by such a large margin. 

Qthreads MTS 32 core performance is faster than or 
comparable to the performance of ICC and GCC. In abso¬ 
lute execution time, MTS runs faster than ICC for five of 
the seven benchmarks by up to 74.4%. It is over 6.6 x faster 
for one benchmark than GCC and up to 65.6% faster on 
four of the six others. On two benchmarks MTS runs 
slower: for Alignment it is 12.4% slower than ICC and 
2.7% slower than GCC, and for Strassen it is 5.8% slower 
than both (although WS equaled GCC’s performance—see 
the discussion on Strassen in Section 4.2). Thus, even as a 
research prototype, ROSE/Qthreads provides competitive 
OpenMP task execution. 

4.2 Individual performance on Intel Nehalem 

Individual benchmark performance on multiple implemen¬ 
tations of the OpenMP run-time demonstrates features of 
particular applications where Qthreads generates better 
scheduling and where it needs further development. Exam¬ 
ining where the run-times differ in achieved speedup 
reveals the strengths and weaknesses of each scheduling 
approach. 

The Health benchmark, Figure 3, shows significant 
diversity in performance and speedup. GNU performance 
is slightly superlinear for four cores (4.5 x), but peaks with 
only 8 cores active (6.3 x) and by 32 cores the speedup is 
only 2x. Intel also has scaling issues and performance 
flattens to 9 x at 16 cores. Stock Qthreads Q scales slightly 
better (9.4 x), but just switching to the LIFO queue L to 
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Table 2. Sequential and parallel execution times using ICC, GCC, and the Qthreads MTS scheduler (time in seconds). For Alignment and 
SparseLU, the best time between the two parallel variations (single and for) is shown. 


Configuration 

Alignment 

Fib 

Health 

NQueens 

Sort 

SparseLU 

Strassen 

ICC -02 -xHost -ipo Serial 

28.33 

100.4 

15.07 

49.35 

20.14 

1 17.3 

169.3 

GCC -02 Serial 

28.06 

83.46 

15.31 

45.24 

19.83 

1 19.7 

162.7 

ICC 32 threads 

0.91 10 

4.036 

1.670 

1.793 

1.230 

7.901 

10.13 

GCC 32 threads 

0.9973 

5.283 

7.460 

1.766 

1.204 

4.517 

10.13 

Qthreads MTS 32 workers 

1.024 

3.189 

1.122 

1.591 

1.080 

4.530 

10.72 


■ MTS ■ WS ■ CQ ■ L ■ Q "ICC GCC 



Number of Threads 


Figure 3. Health on four-socket Intel Nehalem. 



4 8 16 32 


Number of Threads 


Figure 4. Sort on four-socket Intel Nehalem. 

improve locality between tasks allows speedup on 32 cores 
to reach 11.5x. Since the individual tasks are relatively 
small, CQ experiences contention on its task queue that 
limits speedup to 7.7 x on 16 cores, with performance 
degrading to 6.lx at 32 cores. When work stealing, WS, 
is added to Qthreads the performance improves slightly and 
speedup reaches 11.6 x. MTS further improves locality and 
load balance on each processor by sharing a queue across 
the cores on each chip, and speedup increases to 13.6x 
on 32 cores. This additional scalability allows QthreadMUS 
a 17.3% faster execution time on 32 cores than any other 
implementation, much faster than ICC (48.7%) and 
GCC(116.1%). Health provides an excellent example of 
how both work stealing and queue sharing within a sys¬ 
tem can independently and together improve perfor¬ 
mance, though the failure of any run-time to reach 50% 
efficiency on 32 cores shows that there is room for 
improvement. 


■MTS ■ WS ■ CQ «L "Q ■ ICC ■ GCC 



Number of Threads 


Figure 5. NQueens on four-socket Intel Nehalem. 

The benefits of hierarchical scheduling can also be seen 
in Figure 4. Sort, for which we used a manual cutoff of 32K 
integers to switch between parallel and serial sorts, 
achieved speed up of about 16 x for 32 cores on ICC and 
GCC, but just 11.4x for the base version of Qthreads, Q. 
The switch to a LIFO queue, L, improved speedup to 
13.6x by facilitating data sharing between a parent and 
child. Independent changes to add work stealing, WS, and 
improve load balance, CQ, both improved speedup to 
16 x. By combining the best features of both work stealing 
and multiple threads sharing a queue, MTS increased 
speedup to 18.4x and achieved a 13.8% and 11.4% reduc¬ 
tion in overall execution time compared to ICC and GCC 
OpenMP versions respectively. 

Locality effects allow NQueens to achieve slightly 
superlinear speedup for four and eight cores using 
Qthreads. As seen in Figure 5, speedup is near-linear for 
16 threads and only somewhat sublinear for 32 threads on 
all OpenMP implementations. By adding load balancing 
mechanisms to Qthreads, its speedup improved signifi¬ 
cantly (24.3 x to 28.4x). CQ and WS both improved load 
balance beyond what the LIFO queue ( L ) provides and little 
is gained by combining them together in MTS. The addi¬ 
tional scaling of these three versions results in an execution 
time 12.6% faster than ICC and 10.9% faster than GCC. 

Fib, Figure 6, uses a cut-off to stop the creation of very 
small tasks, and thus has enough work in each task to 
amortize the costs of queue access. CQ yields performance 
2-3% faster than MTS and the other versions of Qthreads, 
since load balance is good and no time is spent looking 
for work. The load balancing versions of Qthreads 
(26.lx-26.7x) scale better than Intel at 24.9x. Both 
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4 8 16 32 

Number of Threads 


Figure 6. Fib on four-socket Intel Nehalem. 



4 8 16 32 

Number of Threads 


Figure 7. Alignment- single on four-socket Intel Nehalem. 



Figure 8. Alignment- for on four-socket Intel Nehalem. 


systems beat GCC substantially at only 15.8x. Overall, the 
scheduling improvements resulted in MTS running 26.5% 
faster than ICC and 28.8% faster than GCC but 2.0% 
slower than CQ. 

The next two applications, Alignment and SparseLU, 
each have two versions. For Alignment , Figures 7 and 8, 
speedup was near-linear for all versions, and execution 
times between GCC and Qthreads were close (GCC 
+2.7% single initial task version; Qthreads +0.5% parallel 
loop version). ICC scales better than GCC or Qthreads 
MTS, WS, CQ, with 12.4% lower execution time. Since 
Alignment has no taskwait synchronizations, we speculate 
that ICC scales better on this benchmark because it 



4 8 16 32 

Number of Threads 


Figure 9. SparseLU- single on four-socket Intel Nehalem. 


■ MTS "WS ■ CQ -L "Q ■ ICC 

32 1 — 

28 — 

24 — 
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Figure 10. SparseLU- for on four-socket Intel Nehalem. 

maintains fewer bookkeeping data structures in the absence 
of synchronization. 

On both SparseLU versions, ICC serial performance 
improved nearly 3 x using the -ipo and -xHost flags rather 
than using -02 alone. The flags also improved parallel per¬ 
formance, but by only 60%, so the improvement does not 
scale linearly. On SparseLU- single, Figure 9, the perfor¬ 
mance of GCC and the various Qthreads versions is effec¬ 
tively equivalent, with speedup reaching 26.2 x. Due to the 
aforementioned scaling issues, ICC speedup reaches only 
14.8x. The execution times differ by 0.3% between GCC 
and MTS , with both about 74.4% faster than ICC. On Spar¬ 
seLU- for, Figure 10, the GCC OpenMP runs were stopped 
after exceeding the sequential time; thus data is not 
reported. ICC again scales poorly (14.8 x), and Qthreads 
speedup improves due to the LIFO work queue and work 
stealing, reaching 22.2x. MTS execution time is 46.3% 
faster than ICC. 

Strassen, Figure 11, performs recursive matrix multipli¬ 
cation using Strassen’s method and is challenging for 
implementations with multiple workers accessing a queue. 
We used the cut-off setting that gave the best performance 
for each implementation: coarser (128) for CQ and MTS 
and the default setting (64) for the others. The execution 
times of GCC and WS are within 1% of each other on 32 
cores, and Intel scales slightly better (16.7x vs 16.lx).For 
MTS, in which only 8 threads share a queue (rather than 32 
as in CQ) the speedup reaches 15.2x. For CQ, however, the 
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Figure I I. Strassen on four-socket Intel Nehalem. 

performance hit due to queue contention is substantial, as 
speedup peaks at 9.7 x. Q performance suffers from the 
FIFO ordering: not enough parallel work is expressed at 
any one time, and speedup never exceeds 4 x. 

4.3 Variability 

One interesting feature of a work-stealing run-time is an 
idle thread’s ability to search for work and the effect this 
has on performance in regions of limited parallelism or load 
imbalance. Table 3 gives the standard deviation of 10 runs 
as a percentage of the fastest time for each configuration 
tested with 32 threads. Both Qthreads implementations 
with work stealing {WS and MTS) have very small variation 
in execution time for three of the nine programs. For eight 
of the nine benchmarks, both WS and MTS show less varia¬ 
bility than ICC. 

In three cases {Alignment- single, Health, SparseLU- 
single), Qthreads WS variability was much lower than 
MTS. Since MTS enables only one worker thread per shep¬ 
herd at a time to steal a chunk of tasks, it is reasonable to 
expect this granularity to be reflected in execution time 
variations. Overall, we see less variability with WS than 
MTS in six of the nine benchmarks. We speculate that nor¬ 
mally having all the threads looking for work leads to 
finding the last work quickest and therefore less variation 
in total execution time. However, for some programs 
{Alignment- for, SparseLU- for, Strassen ), stealing multi¬ 
ple tasks and moving them to an idle shepherd results in 
faster execution during periods of limited parallelism. 
WS also shows less variability than GCC in six of the eight 
programs for which we have data. There is no data for 
SparseLU- for on GCC, as explained in the previous 
section. 

4.4 Performance analysis of MTS 

Limiting the number of inter-chip load balancing opera¬ 
tions is central to the design of our hierarchical scheduler 
{MTS). Consider the number of remote (off-chip) steal 
operations performed by MTS and by the flat work¬ 
stealing scheduler WS, shown in Table 5 These counts 
exclude the number of on-chip steals performed by WS, and 


recall that MTS uses work stealing only between chips. We 
observe that WS steals more than MTS in almost all cases, 
and in some cases by an order of magnitude. Health and 
Sort are two benchmarks where MTS wins clearly in terms 
of speedup. WS steals remotely over twice as many times as 
MTS on Sort and nearly twice as many times as MTS on 
Health. The number of failed steals is also significantly 
higher with WS than with MTS. A failed steal occurs when 
a thief s lock-free probe of a victim indicates that work is 
available but upon acquisition of the lock to the victim’s 
queue the thief finds no work to steal because another 
thread has stolen it or the victim has executed the tasks 
itself. Thus, both failed and completed steals contribute 
to overhead costs. 

The MTS scheduler aggregates inter-chip load balancing 
by permitting only one worker at a time to initiate bulk 
stealing from remote shepherds. Figure 12 shows how this 
improves performance on Health, one of the benchmarks 
sensitive to load balancing granularity. If only one task is 
stolen at a time, subsequent steals are needed to provide all 
workers with tasks, adding to overhead costs. There are 
eight cores per socket on our test machine, thus eight work¬ 
ers per shepherd, and a target of eight tasks stolen per steal 
request. This coincides with the peak performance: when 
the target number of tasks stolen corresponds to the number 
of workers in the shepherd, all workers in the shepherd are 
able to draw work immediately from the queue as a result 
of the steal. 

Frequently, the number of tasks available to steal is 
less than the target number to be stolen. Table 5 shows the 
total number of tasks stolen and the average number of 
tasks stolen per steal operation. Across all benchmarks, 
the range of tasks stolen per steal is 2.8 to 6.0. The num¬ 
bers skew downward due to a scarcity of work during 
start-up and near termination, when only one or few tasks 
are available at a time. Note the lower number both of 
total steals and tasks per steal for the for versions of 
Alignment and SparseLU compared to the single versions. 
Loop parallel initialization provides good initial load bal¬ 
ance so that fewer steals are needed, and those that do 
occur sporadically are near termination and synchroniza¬ 
tion phases. 

Another benefit of the MTS scheduler is better L3 cache 
performance, since all workers in a shepherd share the on- 
chip L3 cache. The WS scheduler exhibits poorer cache per¬ 
formance and, consequently, more reads to main memory. 
Tables 7 and 8 show the relevant metrics for Health and 
Sort as measured using hardware performance counters, 
averaged over 10 runs. They also show more traffic on the 
Quick Path Interconnect (QPI) between chips for WS than 
for MTS. QPI traffic occurs when data is requested and 
transferred from either remote memory or remote L3 cache, 
i.e. attached to a different socket of the machine. Not only 
are remote accesses higher latency, but they also result in 
remote cache invalidations of shared cache lines and subse¬ 
quent coherence misses. Increased QPI traffic in WS 
reflects more remote steals and more accesses to data in 
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Table 3. Variability in performance on four-socket Intel Nehalem using ICC, GCC, MTS, and WS schedulers (standard deviation as a 
percentage of the fastest time). 


Configuration 

Alignment 

(single) 

Alignment 

(for) 

Fib 

Health 

NQueens 

Sort 

SparseLU 

(single) 

SparseLU 

(for) 

Strassen 

ICC 32 threads 

4.4 

2.0 

3.7 

2.0 

3.2 

4.0 

l.l 

3.9 

1.8 

GCC 32 threads 

0.1 1 

0.34 

2.8 

0.35 

0.77 

1.8 

0.49 

N/A 

1.4 

Qthreads MTS 32 workers 

0.28 

1.5 

3.3 

1.3 

0.78 

1.9 

0.15 

0.16 

1.9 

Qthreads WS 32 shepherds 

0.035 

1.8 

2.0 

0.29 

0.60 

0.90 

0.060 

0.24 

3.0 

Table 4. Tasks stolen and tasks per steal using the MTS scheduler. Average of 10 runs. 

Alignment (single) Alignment (for) Fib 

Health 

NQueens Sort 

SparseLU (single) ! 

SparseLU (for) 

Strassen 

Tasks Stolen 5900 

450 

2181 

159386 423 

5214 

93117 

38198 

1355 

Tasks Per Steal 5.8 

4.1 

3.4 

5.5 

4.1 

4.6 

5.1 


2.8 

6.0 


Table 5. Number of remote steal operations during execution by 
Qthreads MTS and WS schedulers. In a failed steal, the thief 
acquires the lock on the victim’s queue after a positive probe for 
work but ultimately finds no work available for stealing. On-chip 
steals performed by the WS scheduler are excluded. Average of 
10 runs. 


Benchmark 

MTS 

WS 

Steals 

Failed 

Steals 

Failed 

Alignment (single) 

1016 

88 

3695 

255 

Alignment (for) 

109 

122 

1431 

286 

Fib 

633 

331 

467 

984 

Health 

28948 

10323 

295637 

47538 

NQueens 

102 

141 

1428 

389 

Sort 

1 134 

404 

19330 

3283 

SparseLU (single) 

18045 

8133 

68927 

24506 

SparseLU (for) 

13486 

1 1889 

68099 

32205 

Strassen 

227 

157 

14042 

823 



Figure 12. Performance on Health using MTS based on choice of 
the chunk size for stealing. Average of 10 runs on four-socket Intel 
Nehalem. 

remote L3 caches and remote memory. In summary, MTS 
gains advantage by exploiting locality among tasks exe¬ 
cuted by threads on cores of the same chip, making good 
use of the shared L3 cache to access memory less fre¬ 
quently and avoid high latency remote accesses and coher¬ 
ence misses. 


Table 6. Memory performance data for Health using MTS and 
WS. Average of 10 runs on four-socket Intel Nehalem. 


Metric 

MTS 

WS 

%Diff 

L3 Misses 

U6e+06 

2.58e+06 

38 

Bytes from Memory 

8.23e+09 

9.2le+09 

5.6 

Bytes on QPI 

2.63e+10 

2.98e+10 

6.2 


Table 7. Memory performance data for Sort using MTS and WS. 
Average of 10 runs on four-socket Intel Nehalem. 


Metric 

MTS 

WS 

%Diff 

L3 Misses 

l.03e+7 

3.42e+07 

54 

Bytes from Memory 

2.27e+10 

2.53e+10 

5.5 

Bytes on QPI 

4.35e+10 

4.87e+10 

5.6 


4.5 Performance on AMD Magny Cours 

We also evaluate the Qthreads schedulers against ICC and 
GCC on a 2-socket AMD Magny Cours system, one node 
of a cluster at Sandia National Laboratories. Each socket 
hosts an Opteron 6136 multi-chip module: two quad- 
core chips that share a package connected via two internal 
HyperTransport (HT) links. The remaining two HT links 
per chip are connected to the chips in the other socket, 
as shown in Figure 13. Each chip contains a memory con¬ 
troller with 8 GB attached DDR3 memory, a 5 MB shared 
L3 cache, and four 2.4 MHz cores with 64 Kb LI and 512 
Kb L2 caches. Thus, there are a total of 16 cores and 32 
GB memory, evenly divided among four 
HyperTransport-connected NUMA nodes (one per chip, 
two per socket). The system runs Cray compute-node 
Linux kernel 2.6.27, and we used the GCC 4.6.0 with - 
03 optimization and ICC 12.0 with -03 -ipo -msse3 -simd 
optimization. 

We ran the same benchmarks with the same parameters 
as the Intel Nehalem evaluation. Sequential execution 
times are reported in Table 8. Again, interprocedural 


Downloaded from hpc.sagepub.com at PENNSYLVANIA STATE UNIV on April 8, 2016 


























10 


The International Journal of High Performance Computing Applications 


Table 8. Sequential execution times using ICC and GCC on the AMD Magny Cours. 


Configuration 

Alignment 

Fib 

Health 

NQueens 

Sort 

SparseLU 

Strassen 

ICC 

23.93 

107.9 

10.18 

60.56 

18.51 

156.0 

214.9 

GCC 

29.77 

105.0 

10.67 

58.16 

17.72 

153.4 

21 l.l 



Figure 13. Topology of the two-socket/four-chip AMD Magny 
Cours. 

optimization (-ipo) in ICC was essential to match the GCC 
performance; execution time was more than 500 seconds 
without it. The greatest remaining difference between the 
sequential times is for Alignment, where GCC is 20% 
slower than ICC. 

Speedup results using 16 threads are given in Figure 14, 
for Qthreads configurations with one shepherd per chip, 
MTS (4Q); one shepherd per socket, MTS (2Q); one shep¬ 
herd per core (flat work stealing), WS; ICC; and GCC. At 
least one of the Qthreads variants matches or beats ICC and 
GCC on all but one of the benchmarks. Moreover, the 
Qthreads schedulers achieve near-linear to slightly super- 
linear speedup on six of the nine benchmarks: the two ver¬ 
sions of Alignment, Fib, NQueens , and the two versions of 
SparseLU. Of those, speedup using ICC is 22% and 23% 
lower than Qthreads on the two versions of Alignment, 
10% and 18% lower on the two versions of SparseLU, 
9% lower on NQueens and 7% lower on Fib. GCC is 
42% lower than Qthreads on Fib, 9% and 27% lower on the 
two versions of SparseLU , and close on NQueens and both 
versions of Alignment. 

On three of the benchmarks, no run-time achieves ideal 
speedup. Strassen is the only benchmark on which ICC and 
GCC outperform Qthreads, and even ICC falls short of 
10 x. On Sort , the best performance is with Qthreads WS, 
MTS (4Q), and GCC, all at roughly 8x. Speedup is lower 
with Qthreads MTS (2Q) and still lower with CQ, indicating 
that centralized queueing beyond the chip level is counter¬ 
productive. ICC speedup lags behind the other schedulers 
on this benchmark. Speedup on Health peaks at 3.3 x on 
this system using the Qthreads schedulers, with even worse 
speedup using ICC and GCC. 

The variability in execution times is shown in Table 
9. The standard deviations for all of the benchmarks 
on the MTS and WS Qthreads implementations are below 
2% of the best-case execution time. On all but two 


of the benchmarks, the MTS standard deviation is less 
than 1%. 

The Magny Cours results demonstrate that the compet¬ 
itive, and in some cases superior, performance of our 
Qthreads schedulers against ICC and GCC is not confined 
to the Intel architecture. At first glance, differences in per¬ 
formance using the various Qthreads configurations seem 
less pronounced than they were on the four-socket Intel 
machine. However, those differences were strongest on the 
Intel machine at 32 threads, and the AMD system only has 
16 threads. Some architectural differences go beyond the 
difference in core count. MTS is designed to leverage local¬ 
ity in shared L3 cache, but the Magny Cours has much less 
L3 cache per core than the Intel system (1.25 MB/core 
versus 2.25 MB/core). Less available cache also accounts 
for worse performance on the data-intensive Sort and 
Health benchmarks. 


4.6 Performance on SGI Altix 

We evaluate scalability beyond 32 threads on an SGI Altix 
3700. Each of the 96 nodes contains two 1.6 MHz Intel 
Itanium2 processors and 4 GB of memory, for a total of 
192 processors and 384 GB of memory. The nodes are 
connected by the proprietary SGI NUMALink4 network 
and run a single system image of SuSE Linux kernel 
2.6.16. We used the GCC 4.5.2 compiler as the native com¬ 
piler for our ROSE-transformed code and the GCC OpenMP 
run-time for comparison against Qthreads. The version of 
ICC on the system is not recent enough to include support 
for OpenMP tasks. Sequential execution times, given in 
Table 10, are slower than those of the other machines, 
because the Itanium2 is an older processor, runs at a lower 
clock rate, and uses a different instruction set (ia64). 

The best observed performance on any of the bench¬ 
marks was on NQueens, shown in Figure 15. WS achieves 
115 x on 128 threads (90% parallel efficiency) and reaches 
148 x on 192 threads. MTS reaches 134x speedup. (On this 
machine, the MTS configuration has two threads per 
shepherd to match the two processors per NUMA node.) 
CQ tops out at 77 x speedup on 96 threads, beyond which 
overheads from queue contention become overwhelming. 
GCC gets up to only 40 x speedup. Although no run-time 
achieves linear speedup on the full machine, they all reach 
30 x to 32 x speedup with 32 threads; this underlines the 
importance of testing at higher processor counts to evaluate 
scalability. On the Fib benchmark, shown in Figure 16, 
MTS almost doubles the performance of CQ and GCC on 
192 threads, with a maximum speedup of 97 x. CQ peaks 
at 68 x speedup on 128 threads and WS exhibits its worst 
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Figure 14. BOTS benchmarks on 2-socket AMD Magny Cours using 16 threads. 


Table 9. Variability in performance on AMD Magny Cours using 16 threads (standard deviation as a percentage of the fastest time). 


Configuration 

Alignment 

(single) 

Alignment 

(for) 

Fib 

Health 

NQueens 

Sort 

SparseLU 

(single) 

SparseLU 

(for) 

Strassen 

ICC 

2.2 

0.80 

1.3 

14 

l.l 

8.2 

0.62 

0.31 

2.5 

GCC 

0.035 

0.27 

5.4 

0.38 

0.96 

3.5 

0.016 

0.025 

l.l 

Qthreads MTS (4Q) 

0.25 

0.63 

1.5 

0.17 

0.13 

l.l 

0.012 

0.16 

0.98 

Qthreads MTS (2Q) 

0.46 

0.68 

1.4 

0.069 

0.24 

0.30 

0.015 

0.081 

0.87 

Qth reads WS 

0.21 

1.3 

1.5 

0.15 

0.13 

1.8 

0.036 

0.094 

1.4 

Table 10. Sequential execution times on the SGI Altix. 

Configuration 

Alignment 

Fib 

Health 

NQueens 

Sort 

SparseLU 

Strassen 

GCC 

53.96 

139.2 

45.60 

63.62 

33.59 

632.7 

551.3 



Figure 15. NQueens on SGI Altix. 


Figure 16. Fib on SGI Altix. 


performance relative to MTS , maxing out at 77 x speedup 
on 96 threads. 

We see better peak performance on Alignment- for 
(Figure 18) than Alignment- single (Figure 17). WS 
reaches 116x speedup on 192 threads and MTS reaches 
107x, with CQ and GCC performing significantly worse. 
On the other hand, SparseLU- single (Figure 19) scales bet¬ 
ter than SparseLU- for (Figure 20). Peak speedup on Spar¬ 
seLU- single is 89 x with MTS and 86 x with WS, while 
SparseLU- for achieves a peak speedup of 60x. As was the 
case on the four-socket Intel machine, GCC is unable to 


complete after a timeout equal to the sequential execution 
time. 

For three of the benchmarks, no improvement in 
speedup was observed beyond 32 threads: Health , Sort , 
and Strassen. As shown in Figure 21, none exceed lOx 
speedup on the Altix. These were also observed to be the 
most challenging on the four-socket Intel and two-socket 
AMD systems. Health and Sort are the most data- 
intensive and require new strategies to achieve perfor¬ 
mance improvement, an important area of research going 
forward. 
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Figure 17. Alignment- single on SGI Altix. 



16 32 64 96 128 192 

Number of Threads 


Figure 18. Alignment- for on SGI Altix. 



16 32 64 96 128 192 


Number of Threads 


Figure 19. SparseLU- single on SGI Altix. 

5 Related work 

Many theoretical and practical issues of task-parallel lan¬ 
guages and their run-time implementations were explored 
during the development of earlier task-parallel program¬ 
ming models, both hardware supported, e.g. Tera MTA 
(Alverson et al., 1992), and software supported, e.g. Cilk 
(Blumofe et al., 1995; Frigo et ah, 1998). Much of our prac¬ 
tical reasoning was influenced by experience with the Tera 
MTA run-time, designed for massive multithreading and 
low-overhead thread synchronization. Cilk scheduling uses 
a work-first scheduling strategy coupled with a randomized 
work-stealing load balancing strategy shown to be optimal 
(Blumofe and Leiserson, 1994). Our use of shared queues is 


■ MTS ■ WS ■ CQ GCC 



Number of Threads 


Figure 20. SparseLU- for on SGI Altix. 


■ MTS ■ WS ■ CQ GCC 

32 1 — 

28 - 

24 — 


a 20 

s 



Health Sort Strassen 


Figure 21. Health, Sort, and Strassen on SGI Altix: 32 threads. 


inspired by Parallel Depth-First Scheduling (PDFS) 
(Blelloch et al., 1999), which attempts to maintain a sched¬ 
ule close to serial execution order, and its constructive 
cache sharing benefits (Chen et al., 2007). 

The first prototype compiler and run-time for OpenMP 
3.0 tasks was an extension of Nanos Mercurium (Teruel 
et al., 2007). An evaluation of scheduling strategies for 
tasks using Nanos compared centralized breadth-first and 
fully-distributed depth-first work-stealing schedulers 
(Duran et al., 2008b). Later extensions to Nanos included 
internal dynamic cut-off methods to limit overhead costs 
by inlining tasks (Duran et al., 2008a). 

In addition to OpenMP 3.0, there are currently several 
other task-parallel languages and libraries available to 
developers: Microsoft Task Parallel Library (Leijen et al., 

2009) for Windows, Intel Thread Building Blocks (TBB) 
(Kukanov and Voss, 2007), and Intel Cilk Plus (Intel Corp., 

2010) (formerly Cilk++). The task-parallel model and its 
run-time support are also key components of the XI0 
(Charles et al., 2005) and Chapel (Chamberlain et al., 
2007) languages. 

Hierarchical work stealing, i.e. stealing at all levels of a 
hierarchical scheduler, has been implemented for clusters 
and grids in Satin (van Nieuwpoort et al., 2000), ATLAS 
(Baldeschwieler et al., 1996), and more recently in Kaapi 
(Gautier et al., 2007; Quintin and Wagner, 2010). Those 
libraries are not optimized for shared caches in multi-core 
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systems, which is the basis for the shared LIFO queue at the 
lower level of our hierarchical scheduler. The ForestGOMP 
run-time system (Broquedis et al., 2010) also uses work 
stealing at both levels of its hierarchical scheduler, but like 
our system targets NUMA shared memory systems. It sche¬ 
dules OpenMP nested data parallelism by clustering related 
threads (not tasks) into “bubbles,” scheduling them by 
work stealing among cores on the same chip, and selecting 
for work stealing between chips those threads with the low¬ 
est amount of associated memory. Data is migrated 
between sockets along with the stolen threads. 

6 Conclusions and future work 

As multicore systems proliferate, the future of software 
development for supercomputing relies increasingly on 
high-level programming models such as OpenMP for on- 
node parallelism. The recently added OpenMP constructs 
for task parallelism raise the level of abstraction to improve 
programmer productivity. However, if the run-time cannot 
execute applications efficiently on the available multicore 
systems, the benefits will be lost. 

The complexity of multicore architectures grows with 
each hardware generation. Today, even off-the-shelf server 
chips have 6-12 cores and a chip-wide shared cache. 
Tomorrow may bring 30+ cores and multiple caches that 
service subsets of cores. Existing scheduling approaches 
were developed based on a flat system model. Our perfor¬ 
mance study revealed their strengths and limitations on a 
current generation multi-socket multicore architecture and 
demonstrated that mirroring the hierarchical nature of the 
hardware in the run-time scheduler can indeed improve 
performance. Qthreads (by way of ROSE) accepts a large 
number of OpenMP 3.0 programs, and, using our MTS 
scheduler, has performance as high or higher than the 
commonly available OpenMP 3.0 implementations. Its 
combination of shared LIFO queues and work stealing 
maintains good load balance while supporting effective 
cache performance and limiting overhead costs. On the 
other hand, pure work stealing has been shown to provide 
the least variability in performance, an important consider¬ 
ation for distributed applications in which barriers cause 
the application to run at the speed of the slowest worker, 
e.g. in a Bulk Synchronous Processing (BSP) application 
with task parallelism used in the computation phase. 

The scalability results on the SGI Altix are important 
because previous BOTS evaluations (Duran et al., 2009; 
Olivier et al., 2011) only presented results on up to 32 cores. 
It is encouraging that several benchmarks reach speedups 
of 90 x -150 x on 196 cores. The challenge on those bench¬ 
marks is to close the performance gap between observed 
speedup and ideal speedup through incremental reductions 
in overhead costs and idle times and better exploitation of 
locality. Other benchmarks fail to scale well even at 32 
threads or fewer. On the data-intensive Sort and Health 
benchmarks we have observed a sharp increase in compu¬ 
tation time due to increased load latencies compared to 


sequential execution. To ameliorate that issue, we are 
investigating programmer annotations to specify task sche¬ 
duling constraints that identify and maintain data locality. 

One challenge posed by our hierarchical scheduling 
strategy is the need for an efficient queue supporting 
concurrent access on both ends, since workers within a 
shepherd share a queue. Most existing lock-free queues for 
work stealing, such as the Arora, Blumofe, and Plaxton 
(ABP) queue (Arora et al., 1998) and resizable variants 
(Chase and Lev, 2005; Hendler et al., 2006), allow only one 
thread to execute push( ) and pop( ) operations. Lock-free 
double-ended queues (deques) generalize the ABP queue to 
allow for concurrent insertion and removal on both ends of 
the queue. Lock-free deques have been implemented with 
compare-and-swap atomic primitives (Michael, 2003, Sun- 
dell and Tsigas, 2005), but speed is limited by their use of 
linked lists. We are currently working to implement an 
array-based lock-free deque, though even with a lock- 
based queue we have achieved results competitive with and 
in many cases better than ICC and GCC. 

Funding 

This work is supported in part by a grant from the United 
States Department of Defense. Sandia is a multiprogram 
laboratory operated by Sandia Corporation, a Lockheed 
Martin Company, for the United States Department of 
Energy’s National Nuclear Security Administration under 
contract DE-AC04-94AL85000. 

Conflict of interest 

None declared. 

References 

Alverson GA, Alverson R, Callahan D, et al. (1992) Exploiting 
heterogeneous parallelism on a multithreaded multiprocessor. 
In: ICS’92: Proceedings of the 6th ACM International Confer¬ 
ence on Supercomputing , pp. 188-197. 

Arora NS, Blumofe RD and Plaxton CG (1998) Thread scheduling 
for multiprogrammed multiprocessors. In: SPAA ’98: Proceed¬ 
ings of the 10th ACM Symposium on Parallel Algorithms and 
Architectures , pp. 119-129. 

Ayguade E, Copty N, Duran A, et al. (2009) The design of 
OpenMP tasks. IEEE Transactions on Parallel and Distribu¬ 
ted Systems 20: 404^418. 

Baldeschwieler JE, Blumofe RD and Brewer EA (1996) Atlas: an 
infrastructure for global computing. In: EW 7: Proceedings of 
the 7th ACM SIGOPS European Workshop , pp. 165-172. 
Blelloch GE, Gibbons PB and Matias Y (1999) Provably efficient 
scheduling for languages with fine-grained parallelism. 
Journal of the ACM 46: 281-321. 

Blumofe R, Joerg C, Kuszmaul B, et al. (1995) Cilk: An efficient 
multithreaded runtime system. In: PPoPP ’95: Proceedings of 
the 5th ACM SIGPLANSymposium on Principles and Practice 
of Parallel Programming , pp. 207-216. 

Blumofe R and Leiserson C (1994) Scheduling multithreaded 
computations by work stealing. In: SFCS’94: Proceedings of 


Downloaded from hpc.sagepub.com at PENNSYLVANIA STATE UNIV on April 8, 2016 



14 


The International Journal of High Performance Computing Applications 


the 35th Annual Symposium on Foundations of Computer 
Science, pp. 356-368. 

Broquedis F, Aumage O, Goglin B, et al. (2010) Structuring the 
execution of OpenMP applications for multicore architectures. 
In: IPDPS 2010: Proceedings of the 25th IEEE International 
Parallel and Distributed Processing Symposium, pp. 1-10. 

Chamberlain B, Callahan D and Zima H (2007) Parallel program¬ 
mability and the Chapel language. International Journal of 
High Performance Computing Applications 21: 291-312. 

Charles P, Grothoff C, Saraswat V, et al. (2005) XI0: An object- 
oriented approach to non-uniform cluster computing. In: 
OOPSLA ’05: Proceedings of the 20th ACM SIGPLAN Confer¬ 
ence on Object Oriented Programming Systems, Languages, 
and Applications, pp. 519-538. 

Chase D and Lev Y (2005) Dynamic circular work-stealing deque. 
In: SPAA ’05: Proceedings of the 17th ACM Symposium on 
Parallelism in Algorithms and Architectures, pp. 21-28. 

Chen S, Gibbons PB, Kozuch M, et al. (2007) Scheduling threads 
for constructive cache sharing on CMPs. In: SPAA ’07: Pro¬ 
ceedings of the 19th ACM Symposium on Parallel Algorithms 
and Architectures, pp. 105-115. 

Duran A, Corbalan J and Ayguade E (2008a) An adaptive cut-off 
for task parallelism. In: SC08: ACM/IEEE Supercomputing 
2008, pp. 1-11. Piscataway, NJ: IEEE Press. 

Duran A, Corbalan J and Ayguade E (2008b) Evaluation of OpenMP 
task scheduling strategies. In: IWOMP ’08: Proceedings of the 
International Workshop on OpenMP (eds R Eigenmann and 
BR de Supinski), LNCS 5004: 100-110. 

Duran A and Teruel X (2010) Barcelona OpenMP Tasks Suite. 
http://nanos.ac.upc.edu/projects/bots. 

Duran A, Teruel X, Ferrer R, et al. (2009) Barcelona OpenMP Tasks 
Suite: A set of benchmarks targeting the exploitation of task 
parallelism in OpenMP. In: ICPP’09: Proceedings of the 38th 
International Conference on Parallel Processing, pp. 124—131. 

Free Software Foundation Inc (2010) GNU Compiler Collection, 
http ://www. gnu. org/ software/gee/. 

Frigo M, Leiserson CE and Randall KH (1998) The implementa¬ 
tion of the Cilk-5 multithreaded language. In: PLDI’98: 
Proceedings of the 1998 ACM SIGPLAN Conference on Pro¬ 
gramming Language Design and Implementation, pp. 212-223. 

Gautier T, Besseron X and Pigeon L (2007) Kaapi: A thread sche¬ 
duling runtime system for data flow computations on cluster of 
multi-processors. In: PASCO’07: Proceedings of the 2007 
International Workshop on Parallel Symbolic Computation, 
pp. 15-23. 

Hendler D, Lev Y, Moir M et al. (2006) A dynamic-sized non- 
blocking work stealing deque. Distributed Computing 18: 
189-207. 

Intel Corp (2010) Intel Cilk Plus, http://software.intel.com/en-us/ 
articles/intel-cilk-plus/. 

Kukanov A and Voss M (2007) The foundations for scalable 
multi-core software in Intel Threading Building Blocks. Intel 
Technology Journal 11. 

Leijen D, Schulte W and Burckhardt S (2009) The design of a task 
parallel library. SIGPLAN Notices: OOPSLA’09: 24th ACM 
SIGPLAN Conference on Object Oriented Programming 
Systems, Languages, and Applications 44: 227-242. 


Liao C, Quinlan DJ, Panas T et al. (2010) A ROSE-based 
OpenMP 3.0 research compiler supporting multiple runtime 
libraries. In: IWOMP 2010: Proceedings of the 6th Interna¬ 
tional Workshop on OpenMP (eds M Sato, T Hanawa, MS 
Muller, et al.), LNCS 6132: 15-28. 

Michael MM (2003) CAS-based lock-free algorithm for shared 
deques. In: Euro-Par 2003: Proceedings of the 9th Euro-Par 
Conference on Parallel Processing (eds H Kosch, L Boszor- 
menyi and H Hellwagner), LNCS 2790: 651-660. 

Olivier SL, Porterfield AK, Wheeler KB et al. (2011) Scheduling 
task parallelism on multi-socket multicore systems. In: 
ROSS’11: Proceedings of the International Workshop on 
Runtime and Operating Systems for Supercomputers (in Con¬ 
junction with 2011 ACM International Conference on Super¬ 
computing), pp. 49-56. 

OpenMP Architecture Review Board (2008) OpenMP API, Ver¬ 
sion 3.0. 

Quintin JN and Wagner F (2010) Hierarchical work-stealing. In: 
EuroPar’10: Proceedings of the 16th International Euro-Par 
Conference on Parallel Processing: Part I, pp. 217-229. 
Berlin, Heidelberg: Springer. 

Smith BJ (1981) Architecture and applications of the HEP multi¬ 
processor computer system. In: 4th symposium on Real-Time 
Signal Processing, pp. 241-248. 

Sundell H and Tsigas P (2005) Lock-free and practical doubly 
linked list-based deques using single-word compare-and-swap. 
In: OPODIS 2004: 8th International Conference on Principles 
of Distributed Systems (ed T Higashino), LNCS 240-255. 
Teruel X, Martorell X, Duran A, et al. (2007) Support for 
OpenMP tasks in Nanos v4. In: CASCON’07: Proceedings 
of the 2007 Conference of the Center for Advanced Studies 
on Collaborative Research (eds KA Lyons and C Couturier), 
pp. 256-259. 

van Nieuwpoort R, Kielmann T and Bal HE (2000) Satin: Effi¬ 
cient parallel divide-and-conquer in Java. In: Euro-Par’00: 
Proceedings of the 6th international Euro-Par Conference 
on Parallel Processing, pp. 690-699. London, UK: Springer. 
Wheeler KB, Murphy RC and Thain D (2008) Qthreads: An API 
for programming with millions of lightweight threads. In: 
IPDPS 2008: Proceedings of the 22nd IEEE International 
Symposium on Parallel and Distributed Processing, pp. 1-8. 

Authors’ Biographies 

Stephen L Olivier is a PhD candidate in the Department of 
Computer Science at the University of North Carolina at 
Chapel Hill. His general research interests include multi¬ 
core technologies, programming languages, and perfor¬ 
mance analysis for high performance computing. His 
dissertation research focuses on efficient run time schedul¬ 
ing techniques for task-parallel computation, using the 
Qthreads library as a vehicle for their implementation. Ste¬ 
phen holds the UNC Computer Science Alumni Fellowship 
and was previously a National Defense Science and Engi¬ 
neering (NDSEG) Fellow. He is also a contributor to the 
OpenMP Language Committee. 


Downloaded from hpc.sagepub.com at PENNSYLVANIA STATE UNIV on April 8, 2016 



Olivier et al. 


15 


Allan K Porterfield has been a HPC Senior Scientist at the 
Renaissance Computing Institute (RENCI) in Chapel Hill, 
North Carolina since 2006. Current projects include 
MAESTRO/Qthreads, a lightweight threading run-time to 
support dynamic programming models on a wide variety 
of modern processors. Previously, he spent 17 years at 
Tera/Cray Inc. working in the compiler group for the 
MTA/XMT architecture. He was primarily responsible for 
the instruction simulator and the linking tools, but other 
duties spanned the entire suite of compiler tools. Dr Porter¬ 
field received his PhD from Rice University in 1989 under 
Dr Ken Kennedy. 

Kyle B Wheeler received his PhD from the Department of 
Computer Science and Engineering at the University of Notre 
Dame in 2009. Prior to that, he received an MS in Computer 
Science from the University of Notre Dame in 2005, and a BS 
in Computer Science from Ohio University. He is a Senior 
Member of Technical Staff at Sandia National Laboratories, 
and has been working in the scalable system software field for 
almost a decade, and focused on lightweight threading envir¬ 
onments for the last six years. His current research focuses on 
multi-node task scheduling, particularly applying adaptive 
scheduling techniques to that scheduling regime, and addi¬ 
tionally developing novel task-based collective synchroniza¬ 
tion designs using fine-grained synchronization primitives. 
Kyle is the primary author of the Qthreads tasking library, and 
is the author of the shared-memory implementation of the 
Portals4 communication interface. 


Michael Spiegel is a postdoctoral research associate at 
the Renaissance Computing Institute (RENCI) in Chapel 
Hill, North Carolina. He successfully defended his 
dissertation on “Cache-Conscious Concurrent Data 
Structures” in April 2011 from the Department of Com¬ 
puter Science at the University of Virginia. His current 
research activities focus on large-scale bioinformatics 
algorithms and the design and implementation of effi¬ 
cient memory-aware parallel run-time systems. Michael 
is one of the principal contributors to the OpenMx proj¬ 
ect, an open source R library for extensible structural 
equation modeling. He is interested in embedding con¬ 
currency and parallelism into the core undergraduate 
computer science curriculum. 

Jan F Prins is a Professor in the Department of Computer 
Science at the University of North Carolina at Chapel Hill. 
He obtained his PhD in 1987 in Computer Science from 
Cornell University. His research interests center on parallel 
computing, including algorithm design, computer architec¬ 
ture, and programming languages. He collaborates widely 
on applications of parallel computing in bioinformatics and 
computational biology, and in the physical sciences and 
engineering. He was a visiting professor at the Institute for 
Theoretical Computer Science at ETH Zurich, in the area of 
scientific computing. His research has been sponsored by 
AFOSR, ARO, DARPA, DOE, EPA, NIH, NSA, NSF, 
ONR and by industry, including IBM, Microsoft, and HPC 
companies. 


Downloaded from hpc.sagepub.com at PENNSYLVANIA STATE UNIV on April 8, 2016 



